A Pattern Matching Approach for Redundancy Detection in Bi-lingual and Mono-lingual Corpora

نویسندگان

  • Muneer Ahmad
  • Hassan Mathkour
چکیده

---The Bi-Lingual and Mono-Lingual Corpora Information relating to numerous Languages may be duplicated. This leads to slow and inaccurate search results from Bi-Lingual and Mono-Lingual databases. It is essential to structure the Sequences in a fashion that reduces the redundant sequence structure so that the analysis of BiLingual and Mono-Lingual Corpora structure is accurate to help in analyzing the features of certain complex and subjective languages. The detection will lead to the selection of right solution from large Corpora's. In this paper, we present an algorithm (we call it DSDR) that operates on a set of Bi-Lingual and Mono-Lingual Corpora and iterates in the same set to find all possible duplications present in the set. Once the duplications are found, the DSDR removes duplicated Chains and refreshes the databases resulting in remarkable reductions in the sizes of the databases. In addition, the speed of searches of certain Chains from Bi-Lingual and Mono-Lingual Corpora becomes quite fast and accurate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparison in Reading Ability and Achievement between Mono-Lingual and Bilingual Fifth Graders

A Comparison in Reading Ability and Achievement between Mono-Lingual and Bilingual Fifth Graders Y. Adib, Ph.D. Z. Sharifi N. Mahmoodi To  compare both the reading ability and academic achievement among Farsi speaking mono-lingual fifth graders and their bilingual Aazari and Kordi counterparts, three samples of 153, 132, and 145 (total 430) such students from three cities o...

متن کامل

English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...

متن کامل

Approaching Multi-Lingual Emotion Recognition from Speech - On Language Dependency of Acoustic/Prosodic Features for Anger Recognition

In this paper, we describe experiments on automatic Emotion Recognition using comparable speech corpora collected from real-life American English and German Interactive Voice Response systems. We compute the optimal set of acoustic and prosodic features for mono-, crossand multi-lingual anger recognition, and analyze the differences. When an emotion recognition system is confronted with a langu...

متن کامل

Robust Cross-Lingual Genre Classification through Comparable Corpora

Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections o...

متن کامل

The 5th Workshop on Building and Using Comparable Corpora

Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009